A Comparison of Two Smoothing Methods for Word Bigram Models

نویسنده

  • Linda C. Bauman Peto
چکیده

Word bigram models estimated from text corpora require smoothing methods to estimate the probabilities of unseen bigrams. The deleted estimation method uses the formula: Pr(i j j) = fi + (1 )fijj ; where fi and fijj are the relative frequency of i and the conditional relative frequency of i given j, respectively, and is an optimized parameter. MacKay (1994) proposes a Bayesian approach using Dirichlet priors, which yields a di erent formula: Pr(i j j) = Fj + mi + 1 Fj + fijj where Fj is the count of j and and mi are optimized parameters. This thesis describes an experiment in which the two methods were trained on a two-million-word corpus taken from the Canadian Hansard and compared on the basis of the experimental perplexity that they assigned to a shared test corpus. The methods proved to be about equally accurate, with MacKay's method using fewer resources. ii Acknowledgements I would like to thank my supervisors, Graeme Hirst and Dekai Wu. Dekai suggested the topic of this thesis and provided technical guidance, over and above his duties at the Hong Kong University of Science and Technology. Graeme coached me through departmental procedures, corrected the writing style, and exhibited patience and optimism. David MacKay helpfully explained the details of his method, and advised me regarding the implementation. He also served as the external reader. Bill Gale, Radford Neal, and Peter Brown answered statistical questions. Finally, I would like to thank my family, James and Michael, for their love and encouragement, and God, who makes all things possible.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The University of Amsterdam at NTCIR-5

We describe the University of Amsterdam’s participation in the Cross-Lingual Information Retrieval task at NTCIR-5. We focused on Chinese monolingual retrieval, and aimed to study the effectiveness of language models and different tokenization methods for Chinese. Our main findings are the following. First, where the vector space model excels on a bigram index, the language model performs poorl...

متن کامل

Chinese Unknown Word Identification Based on Local Bigram Model

This paper presents a Chinese unknown word identification system based on a local bigram model. Generally, our word segmentation system employs a statistical-based unigram model. But to identify those unknown words, we take advantage of their contextual information and apply a bigram model locally. By adjusting the value of interpolation which is derived from a smoothing method, we combine thes...

متن کامل

An Empirical Study of Smoothing Techniques for Language Modeling

We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first t ime how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and n-gram order (bigram versus trigram) affect the relative pe...

متن کامل

Aggregate and mixed-order Markov models for statistical language processing

We consider the use of language models whose size and accuracy are intermediate between different order n-gram models. Two types of models are studied in particular. Aggregate Markov models are classbased bigram models in which the mapping from words to classes is probabilistic. Mixed-order Markov models combine bigram models whose predictions are conditioned on different words. Both types of m...

متن کامل

N-gram Based Two-Step Algorithm for Word Segmentation

This paper describes an n-gram based reinforcement approach to the closed track of word segmentation in the third Chinese word segmentation bakeoff. Character n-gram features of unigram, bigram, and trigram are extracted from the training corpus and its frequencies are counted. We investigated a step-by-step methodology by using the n-gram statistics. In the first step, relatively definite segm...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/cmp-lg/9410034  شماره 

صفحات  -

تاریخ انتشار 1994